Zero-Shot Neural Decoding with Semi-Supervised Multi-View Embedding

Zero-shot neural decoding aims to decode image categories, which were not previously trained, from functional magnetic resonance imaging (fMRI) activity evoked when a person views images. However, having insufficient training data due to the difficulty in collecting fMRI data causes poor generalization capability. Thus, models suffer from the projection domain shift problem when novel target categories are decoded. In this paper, we propose a zero-shot neural decoding approach with semi-supervised multi-view embedding. We introduce the semi-supervised approach that utilizes additional images related to the target categories without fMRI activity patterns. Furthermore, we project fMRI activity patterns into a multi-view embedding space, i.e., visual and semantic feature spaces of viewed images to effectively exploit the complementary information. We define several source and target groups whose image categories are very different and verify the zero-shot neural decoding performance. The experimental results demonstrate that the proposed approach rectifies the projection domain shift problem and outperforms existing methods.


Introduction
Neural decoding has enabled the interpretation of a person's cognitive state from evoked brain activity. This attempt makes a significant contribution to the development of brain-computer interfaces (BCIs) [1] that would establish communication between computer systems and human brain activity. Several machine learning methods [2][3][4] have revealed what a person viewed from evoked brain activity. In these methods, functional magnetic resonance imaging (fMRI) activity patterns that were measured when a subject viewed several images (e.g., fish, chair, and face) were classified into a valid image category. Since these methods focused on the relationship between fMRI activity patterns and image categories used in the training phase, the predicted categories were restricted to trained categories only. It is not feasible to acquire fMRI activity patterns for all possible categories; therefore, the results of these methods are significantly limited.
To overcome this limitation, zero-shot neural decoding approaches attempt to decode viewed images [5][6][7][8][9][10] and meanings of presented words [11][12][13] that were not trained previously from corresponding brain activity. For example, the studies [6][7][8]14] estimated mappings between fMRI activity patterns and visual features that were extracted from the viewed images via convolutional neural networks (CNN) [15]. Then, visual features were predicted from fMRI activity measured when a subject viewed a novel category. Finally, the decoding was made feasible by comparing the predicted features with visual features from various categories. On the other hand, zero-shot learning, which aims to recognize novel classes without labeled training samples, suffers from the projection domain shift problem [16][17][18]. Zero-shot learning can be considered as transfer learning that transfers the knowledge from the training/source domain to the test/target domain. When two domains are potentially unrelated, the mappings learned from one source domain may not correctly capture the relationship of the target domain, and this is known as the projection domain shift problem (see Figure 1a). Furthermore, collecting brain activity patterns is a very laborious task since the measurement requires a heavy burden on subjects. Thus, the number of source categories is small, and source categories may be limited to certain categories only. Therefore, the projection domain shift problem is remarkable, resulting in poor decoding performance. The projection domain shift problem in zero-shot neural decoding. When the source categories (i.e., ostrich, kangaroo, and dolphin) and the target categories (i.e., mailbox and iPod) are potentially different, the projections that connect fMRI activity with the visual feature space are biased towards the source categories. This bias could keep target fMRI activity embeddings away from the actual target category embeddings. (b) Our proposed semi-supervised multi-view embedding. We embedded additional visual and semantic features extracted from images related to the target categories without fMRI activity patterns. By introducing additional visual and semantic features, the biased projections towards the source categories are compensated, and the projection domain shift problem is alleviated.
In this paper, we propose a zero-shot neural decoding approach with semi-supervised multi-view embedding. To rectify the projection domain shift problem, we introduce a semi-supervised framework that utilizes images related to the target categories without fMRI activity patterns. Our framework can be seen as domain adaptation relying on images from the target domain, which can be collected at a reasonable cost. Furthermore, we project fMRI activity patterns into a multi-view embedding space: the visual feature space and the semantic feature space. Specifically, visual features are extracted from images via CNN, and semantic features are extracted from image categories via a distributed word representation model. Our semi-supervised multi-view approach can consider different feature spaces that contain complementary information while introducing additional visual and semantic features extracted from images related to the target categories (see Figure 1b).
Most importantly, fMRI activity patterns corresponding to additional visual and semantic features are not necessary for the proposed method.
Towards the realization of our approach, we construct a semi-supervised multi-view generative model. Our proposed model assumes that fMRI activity, visual features, and semantic features are generated from a shared latent variable under a Bayesian framework [19,20]. Furthermore, additional visual and semantic features are incorporated into the model to solve the projection domain shift problem. We consider unobserved fMRI activity patterns corresponding to additional features as missing values and estimate these missing values while optimizing model parameters. Figure 2 demonstrates the embedding space in the previous method [6] and the proposed model by plotting visual and semantic features predicted from fMRI activity (we used fMRI activity collected from subject 3 in the publicly available fMRI dataset [6]) onto two dimensions using t-SNE [21]. In the visual feature space of the method [6] illustrated in Figure 2a, the target category embeddings of artifact are biased towards the source categories of living thing and are separated from their prototypes. On the other hand, this bias is alleviated in the visual feature space of our method, and target category embeddings of artifact are projected properly in the semantic feature space illustrated in Figure 2b. We solve the projection domain shift problem by the semi-supervised approach and consider a better feature space by the multi-view embedding approach.

(b) Semantic feature space (Ours)
living thing (embedding) artifact (embedding) living thing (prototype) artifact (prototype) Figure 2. (a) The visual feature space by the previous method [6] and (b) the visual and semantic feature spaces by the proposed model. As seen in Figure 1, source categories belong to only living thing, and target categories of living thing (red circles) and artifact (blue circles) are projected into the embedding space. Prototypes of living thing (red stars) and artifact (blue stars) represent the ground truth embeddings.
The main contributions of this paper are as follows: • To the best of our knowledge, this is the first study to address the projection domain shift problem in zero-shot neural decoding. For this problem, we introduce the semisupervised framework that employs images related to the target categories without fMRI activity patterns. • We propose multi-view embeddings that associate fMRI activity patterns with visual features from images and semantic features from image categories. • We address the difficulty in collecting fMRI data and estimate unobserved fMRI activity patterns in a fully probabilistic manner [22]. Furthermore, the Bayesian framework of our model automatically selects a small set of appropriate components from high dimensional fMRI voxels.

Multi-View Learning
Multi-view learning [23,24] is machine learning which considers learning with multiple views to improve the generalization performance and provide complementary information. Canonical correlation analysis (CCA) [25] is a classical but still powerful tool for analyzing multi-view paired samples. CCA finds projection matrices such that the correlation between projected paired samples in the shared latent space is maximized. Semi-supervised CCA (SemiCCA) [26] utilizes additional unpaired samples to prevent the performance degradation of CCA when the number of paired samples is limited. The CCA framework is also applied to neural decoding [7,14], where pairs of fMRI data and visual features of images that a subject viewed are associated. The method in [14] predicts visual features from fMRI data utilizing CCA to estimate the viewed image categories. The method in [7] introduces semi-supervised fuzzy discriminative CCA (Semi-FDCCA), where additional images not used in measuring fMRI data are utilized by SemiCCA and the similarity information of image categories is incorporated. However, one of the central problems in CCA is the model selection or how to select the dimensions to be retained [27]. In particular, since the number of dimensions of fMRI data is high, dimension reduction must be applied before input to the CCA, resulting in performance degradation.
The proposed model is constructed under the same framework as Bayesian canonical correlation analysis (BCCA) [19,27] and group factor analysis (GFA) [20]. BCCA is a generative model that links two observed variables via a shared latent variable, and GFA also treats more than two observed variables. BCCA and GFA introduce automatic relevance determination (ARD) [28] via the Bayesian approach, which automatically selects the appropriate dimensions from the data. Therefore, we can automatically select a small set of appropriate components from high dimensional fMRI voxels. The proposed model handles three observed variables (i.e., fMRI activity, visual features, and semantic features) and can be easily extended to semi-supervised learning. We also refer to the proposed model without the semi-supervised learning scenario as the multi-view generative model (MGM) [29].

Zero-Shot Learning
Zero-shot learning (ZSL) is the recognition of novel visual categories without labeled training samples for visual recognition. Inspired by humans' ability to recognize a new object category without ever seeing a visual instance, ZSL aims to recognize a visual instance of a new category that has never been seen before [17]. Common ZSL methods learn a projection function from a visual feature space of images to a semantic embedding space (e.g., a semantic attribute space or a semantic word vector space) using the labeled training data consisting of seen classes only. At test time for recognizing unseen objects, this projection function is then used to project the visual representation of an unseen class image into the semantic embedding space. However, ZSL models mostly suffer from the projection domain shift problem [16]. That is, if the projection is learned only from the seen classes, the projections of unseen class images are likely to be misplaced (shifted) due to the bias of the seen classes [18]. Zero-shot neural decoding also suffers from the domain shift problem when projecting fMRI activity to the visual feature space and semantic feature space.

Neural Decoding
Neural decoding (or brain decoding) is the task of estimating images or words that a person sees or imagines from his/her brain activity (especially fMRI activity). In the traditional neural decoding methods [2][3][4], fMRI activity that was measured when a person viewed an image was classified into an image category, which enables the estimation of the viewed object. Subsequent neural decoding methods [5][6][7][8][9][10][11][12][13] estimate images or words that were not trained before, i.e., zero-shot neural decoding, by projecting fMRI activity into an intermediate-level feature space (e.g., visual feature space [6] and semantic feature space [12,13]). Recent neural decoding approaches [30][31][32][33] employ semisupervised learning to utilize additional images without corresponding fMRI activity patterns. Beliy et al. [30] aimed to reconstruct high-quality images from evoked fMRI activity while viewing images. Their method utilizes unlabeled data (i.e., images without fMRI recording, and fMRI recording without images) to train fMRI-to-image reconstruction networks. Akamatsu et al. [31] and Du et al. [32] aimed to accurately estimate viewed image categories from evoked fMRI activity. They associate fMRI activity with visual features and semantic [31] or textual features [32] extracted from image categories. They also introduce semi-supervised learning by incorporating additional visual and semantic or textual features without fMRI activity. Liu et al. [33] proposed BrainCLIP, which unifies the visual stimulus classification and image reconstruction from fMRI activity. They leverage the semantic space of CLIP [34] learned from a large-scale vision-language corpus to perform neural decoding tasks. The method is pre-trained with large-scale unlabeled images without fMRI activity.
This paper is an extended version of the previous work [31]. This work aims to address the projection domain shift problem in zero-shot neural decoding, especially when the source domain and the target domain are potentially unrelated. Specifically, we define several source and target groups (e.g., mammal, device, and structure) and validate their decoding performances in zero-shot learning settings. In the previous works [31][32][33], image categories related to the target group were included in the source group (e.g., source: dolphin, airship, and beer mug; target: killer whale, airliner, and coffee mug, respectively). On the other hand, the target and source groups are very different in our setting; therefore, we are tackling a more challenging problem than in previous works.

Proposed Method
Our proposed method consists of the following three phases: the construction of semi-supervised multi-view generative model, the optimization of the model parameters, and the decoding of viewed image categories from evoked fMRI activity.

Semi-Supervised Multi-View Generative Model
Suppose that N ] ∈ R D s ×N represent fMRI activity patterns, visual features, and semantic features, respectively. Here, D f , D v , and D s denote the dimensions of each sample of X ( f ) , X (v) , and X (s) , respectively. N denotes the size of training samples. Visual features X (v) are extracted from viewed images via VGG19 [35], which is a CNN model that was used in a previous neural decoding study [36]. Semantic features X (s) are extracted from viewed image categories based on a word2vec model [37]. Furthermore, sup- N+M ] ∈ R D s ×M represent additional visual and semantic features from images related to the target categories, respectively. Here, a set of additional visual and semantic features is obtained from each category, i.e., M denotes the number of additional image categories. Our approach considers unobserved fMRI activity patterns corresponding to additional visual and semantic features as missing values. These missing values X [22] while we are optimizing the model parameters. Hereafter, the observed and missing variables are redefined asX Note that [·, ·] represents the horizontal concatenation. We also introduce latent variables and semantic featuresX (s) , where D z is the dimension of each sample of Z. In our model, D z was set to the smallest dimension ofX ( f ) ,X (v) , andX (s) . The prior distributions of the missing values X ( f ) miss and latent variables Z are initialized by random variables following a multivariate normal distribution as where N (x|µ, Σ) represents a multivariate Gaussian distribution with a mean µ and a covariance matrix Σ. 0 denotes the zero vector and I D denotes the identity matrix with D × D size. Our proposed model assumes that fMRI activityX ( f ) , visual featuresX (v) , and semantic featuresX (s) are generated from shared latent variables Z by the following likelihood function: where k ∈ { f , v, s}. In Equation (3), W (k) ∈ R D k ×D z is a projection matrix that connects Z withX (k) , and β −1 k is the scalar variable representing noise variance ofX (k) . The prior distribution of the projection matrix W (k) and its hyperprior distribution α (k) are assumed as where α is a variance of the elements in W (k) . In Equation (5), we introduce hyperprior distribution for the inverse variances α (k) , and G(·|α, γ) represents the Gamma distribution with a mean α and a shape parameter γ. In our model, all of the means α (k) 0(i,j) and the shape parameters γ (k) 0(i,j) were set to 1 and 0, respectively, as in the previous study [19]. This form of prior and hyperprior distributions is motivated by the idea of automatic relevance determination (ARD). Since ARD automatically selects important components in the model, we can extract a small number of appropriate features from observed variables. Especially, this sparse representation is effective for fMRI data analysis since fMRI data have thousands of voxels and it is reported that only a small number of voxels is relevant to a visual stimulus [19,38]. This ARD form avoids overfitting due to the high dimensionality and improves generalization performance. Finally, the prior distribution of inverse variance β k is assumed to be an uninformative prior described as p 0 (β k ) = β −1 k . The graphical model of the proposed model is illustrated in Figure 3. The meaning and role of the observed variables and model parameters are summarized in Table 1.
Unobserved fMRI activity fMRI activity corresponding to additional features

Optimization of Model Parameters via Variational Inference
In this subsection, we estimate the missing values X ( f ) miss and optimize model parameters Z, W (k) , α (k) , and β k (k ∈ { f , v, s}). Hereafter, the observed variables (except for the missing values), the projection matrices, and the inverse variances are summarized as respectively. We optimize X ( f ) miss , Z, W, and Θ by calculating a posterior distribution p(X ( f ) miss , Z, W, Θ|X). Since this posterior distribution cannot be calculated analytically, we approximate it by using variational inference [39]. In variational inference, we consider an approximate distribution q(X The distributions of q(X ( f ) miss ), q(Z), q(W (k) ), q(α (k) ), and q(β k ) (k ∈ { f , v, s}) are updated iteratively to maximize the lower bound L(q).
First, we update the distribution of the projection matrix q(W (k) ) according to variational inference as where In Equations (8) and (9), z n(j) represents the mean of the jth variable of z n . Next, the distribution of the latent variables q(Z) is updated as where In Equation (11), Furthermore, the distribution of the missing values q(X In Equation (14), we estimate the missing values, i.e., unobserved fMRI activity patterns, by using the current model parameters W ( f ) , Z, and β f . Finally, we update the distribution of the variances q(α (k) ) and q(β k ) (k ∈ { f , v, s}). The distribution q(α (k) ) is updated as where In our model, σ (k)−1 0(i,j) was set to 0 as in [19]. The distribution q(β k ) is updated as where These update procedures of the model parameters are summarized in Algorithm 1.

Algorithm 1 Update procedures of model parameters
miss , Z, W (k) , α (k) , and β k (k ∈ { f , v, s}) by prior distributions in Equations (1), (2), (4), (5) for number of training iterations do Update the projection matrix W (k) in Equation (7) Update the shared latent variables Z in Equation (10) Update the missing values X ( f ) miss in Equation (14) Update the inverse variances α (k) in Equation (15) Update the inverse variance β k in Equation (18) end for Output:

Decoding of Viewed Image Categories from fMRI Activity
After the optimization of model parameters, image categories are decoded from fMRI activity evoked by test visual stimuli, i.e., unobserved images. In zero-shot neural decoding settings, the categories of test visual stimuli are not included in the training categories.
Suppose that x test |W (i) , z test ), the distribution of projection matrix q(W (i) ), and a posterior distribution p(z test |x ( f ) test ) as follows: In Equation (21), z test is a latent variable shared by x is an unknown distribution, we approximate it based on the distribution of the latent variables q(Z) (Eqs. (10) -(12)). We derive an approximate distributionq(z test ) by omitting the terms with respect to visual features and semantic features in Equations (10)-(12) as follows: where Since the multiple integral of Equation (21) cannot be calculated analytically, we replace q(W (i) ) with W (i) obtained from Equation (9). Moreover, p(x (i) test |W (i) , z test ) is obtained from the likelihood function in Equation (3). From the above approximations, the predictive distribution in Equation (21) becomes where Finally, these predicted visual and semantic features x (i) test (i ∈ {v, s}) are obtained by averaging samples x (i) test through many sets of model training and prediction, In our analysis, the number of sets of model training and prediction was set to T = 100. These predicted features are compared with visual and semantic features obtained from various candidate categories. Specifically, we calculate the Pearson correlation co- test and visual features obtained from a candidate category as in previous studies [6,8]. We also calculate the Pearson correlation coefficient r (s) between x (s) test and semantic features obtained from a candidate category. r (v) and r (s) are calculated separately from visual and semantic features, respectively. The values of r (v) and r (s) range from −1 to 1, and higher values indicate that these features are similar. Then these correlation coefficients are integrated as r (v+s) = η · r (v) + (1 − η) · r (s) , where η (0 ≤ η ≤ 1) is the trade-off parameter of balancing visual and semantic features. The decoding of the viewed image category is feasible by identifying a candidate category that scored the max correlation coefficient r (v+s) .

Dataset
We utilized a publicly available fMRI dataset [6] in this study. This dataset includes fMRI activity patterns measured when five subjects viewed images collected from Ima-geNet [40]. In this dataset, fMRI activity patterns that were measured when each subject viewed 1200 images from 150 categories (eight images per category) were collected for the training data. Furthermore, fMRI activity patterns that were measured when each subject viewed 50 images from 50 categories (one image per category), which differ from training categories, were collected for the test data. Since test fMRI activity was measured 35 times for each test image, we used the average fMRI activity patterns across all trials, as in [6]. We used fMRI activity with approximately 4500-dimensional (the number of voxels) vectors from the visual cortex (V1-V4, the lateral occipital complex, fusiform face area, and parahippocampal place area) defined in the study [6]. Visual features with 512-dimensional vectors were extracted from images via the last convolution layer with an average pooling of VGG19 that was pre-trained on ImageNet. Semantic features with 300-dimensional vectors were obtained from image categories via a word2vec that was pre-trained on a Google News dataset. When the categories were not in the word2vec corpus, we utilized hypernyms on ImageNet whose semantic features were available. For example, since the category 'beer mug' was not in the corpus, we utilized 'mug', which is a hypernym of 'beer mug'.

Conditions
To validate the effectiveness of the semi-supervised multi-view generative model, we compared the proposed model with the following seven methods: existing neural decoding methods [6][7][8]14], BCCA [19], and the multi-view generative model (MGM). The methods in [6,8] predicted visual features of viewed images from fMRI activity via sparse logistic regression (SLR) [38] and multilayer perceptrons (MLP) [41], respectively. The method in [14] projected fMRI activity and visual features into the same latent space by CCA [25]. The method in [7] constructed semi-supervised fuzzy discriminative CCA (Semi-FDCCA) that incorporated similarities of image categories and visual features from various images into the framework of CCA. BCCA and MGM are in the same framework as our proposed model. BCCA assumes that the number of observed variables is two. Here, we constructed the following two BCCA methods: BCCA-V and BCCA-S. The observed variables of BCCA-V are fMRI activity and visual features, and those of BCCA-S are fMRI activity and semantic features. MGM associated fMRI activity with visual features and semantic features, i.e., the proposed model without a semi-supervised learning scenario. We verified the effectiveness of the multi-view approach by comparing ours with BCCA-V and BCCA-S, and verified the effectiveness of the semi-supervised approach by comparing ours with MGM. In all methods, we standardized fMRI activity, visual features, and semantic features before the training and testing phases. In BCCA-V, BCCA-S, MGM, and the proposed model, the number of training iterations was set to 10. In MGM and the proposed model, the trade-off parameter η was set to the best value from η ∈ {0, 0.1, · · · , 1} for each model.
In this experiment, we used identification accuracy to evaluate the decoding performance as in previous studies [6,8]. This evaluation metric represents the accuracy for identifying its ground truth image category between the two candidate categories: one is the ground truth category and another is a randomly selected category (chance level being 50%). This identification was performed based on the correlation coefficient r (v+s) between predicted features and each candidate category feature. We prepared 10,000 candidate categories (including 50 test categories) randomly selected from ImageNet. Therefore, we performed this identification between all combinations of a ground truth category and one of the other 9999 candidate categories. Furthermore, we also evaluated by using rank-n accuracy to confirm where the correct category is ranked in all candidates. Rank-n accuracy refers to the ratio that shows that a correct category is ranked at top-n ranks across 10,000 candidates based on the correlation coefficient r (v+s) .

Results and Discussion
We evaluate the decoding performance in several zero-shot neural decoding settings where the source domain and the target domain are potentially unrelated. In this experiment, we classified categories used in the dataset [6] into eight target groups: mammal, bird, invertebrate, device, container, equipment, structure, and commodity. These groups are defined as subgroups in ImageNet. For zero-shot analysis, several categories that belong to each target group were excluded from the 150 training categories of the dataset [6], and the remaining categories were used for training. Moreover, several categories that belong to each target group were collected from 50 test categories of the dataset [6], and fMRI activity patterns corresponding to these categories were used for test fMRI activity. Furthermore, the proposed model incorporated additional image categories that belong to each target group, which were collected for as many categories as possible from ImageNet. As additional visual features, we used the average visual features extracted from multiple images of an additional category. The numbers of training, test, and additional categories for each target group are described in Table 2. Note that we have eight training samples per training category. Decoding performance Table 3 shows the identification accuracy for each target group. The accuracy was averaged across five subjects. As shown in this table, the proposed model outperforms the seven other methods. By comparing our model with MGM, we confirmed the effectiveness of the semi-supervised approach that employs additional image categories belonging to target groups. We also confirmed that the identification accuracy of SLR and BCCA-V for mammal are remarkably low, whereas the accuracy of BCCA-S is comparatively high. These results indicate that mammal can be decoded more successfully from the semantic feature space than the visual feature space. On the other hand, the accuracy for other target groups such as container and structure from the visual feature space is higher than that of the semantic feature space. Therefore, considering a multi-view embedding space is effective since the better feature space depends on each target group. Table 4 presents the rank-n (n = 100, 1000, 5000) accuracy in 10,000 random candidate categories. From the table, we also confirmed that the rank accuracy of the proposed model is higher than that of comparative methods and outperforms the chance level.
In Table 3, we used 10,000 candidate categories when identification accuracy was calculated. This evaluation metric indicates how accurately the correct category can be identified from a large number of random categories. Here, we also validate how accurately the correct category can be identified from similar categories (i.e., target groups). Table 5 shows the identification accuracy of comparative methods and the proposed method when the candidate categories are in each target group. For example, when test categories of mammal were decoded, we performed identification between all combinations of the ground truth category and one of the candidate categories from mammal. As seen in Table 5, the identification accuracy of almost all target groups are degraded in comparison with Table 3, since it is more difficult to identify a ground truth category from similar candidate categories. Nevertheless, the proposed model outperforms comparison methods for all target groups. By comparing with MGM, we confirmed that the utilization of additional visual and semantic features from each target group contributes to the identification from similar candidate categories. Table 3. Decoding performance in 10,000 random candidate categories. The first line for each method represents the identification accuracy (%) and the second line represents the standard error of five subjects. Existing neural decoding methods (SLR, CCA, Semi-FDCCA, and MLP), a part of the proposed model (BCCA-V, BCCA-S, and MGM), and the proposed model are separated by blocks.

Method
Target Projection domain shift problem To examine the projection domain shift problem, we visualized the L2 norm between target group embedding and target group prototype in the visual and semantic feature spaces, as seen in Figure 4. Target group embedding is the average vector of embedding features of categories from each target group, and target group prototype is the average vector of ground truth values of categories from each target group. The L2 norm was averaged across five subjects. CCA [14] and Semi-FDCCA [7] were not considered in Figure 4 since these methods calculate the distance between features in the canonical space. From Figure 4, the L2 norm of the proposed model is shorter than that of comparative methods in both the visual and semantic feature spaces. This result shows that the semi-supervised approach reduces the embedding biases towards source categories and realizes better target embeddings towards their prototypes. This alleviation of the projection domain shift problem is effective not only in identifying from a large number of random categories (see Table 3), but also in identifying from similar categories (see Table 5). (Identification accuracy tends to improve when the L2 norm is short. However, the L2 norm is not perfectly consistent with identification accuracy. This is because the L2 norm is calculated purely on the distance of the embedding features, while the identification accuracy is calculated by comparing the embedding features with other candidate features. Since the distribution of candidate features is not uniform but depends on the target group, the relationship of the L2 norm between target groups is not consistent with identification accuracy).
Normal setting vs. Zero-shot setting To confirm the performance limit in zero-shot neural decoding settings, we evaluate the performance of the proposed model in normal settings, where target groups are included in the training set (i.e., the original setting of dataset [6]). Table 6 shows the decoding performance of our model when each target group is included in the training set (normal) and not included in the training set (zero-shot). The results are averaged across five subjects. We can see that the performance in the normal setting is higher than in the zero-shot setting for all target groups. Although there is still a gap in performance between the normal and zero-shot settings, the proposed method achieves comparable performance for several target groups. (e.g., bird, container, and equipment). It is a natural result that the normal setting outperforms the zero-shot setting, and we found that the performance of the zero-shot setting is comparable to that of the normal setting for several target groups.

Balance between visual and semantic features
We investigate the balance between visual and semantic features in decoding each target group. Tables 7 and 8 show the best trade-off parameters η representing the importance of visual and semantic features in the normal and zero-shot settings, respectively. In the tables, visual features are important when the value is large, while semantic features are important when the value is small. From Table 7 in the normal setting, we see that the visual and semantic features are equally important in all target groups (i.e., η is around 0.5). On the other hand, Table 8 in the zero-shot setting shows that η is changed from the normal setting in some target groups (e.g., mammal, invertebrate, and device). In groups of living thing (i.e., mammal, bird, and invertebrate), η tends to become smaller, i.e., the importance of semantic features increases. In groups of artifact (i.e., device, container, equipment, structure, and commodity), η tends to become larger, i.e., the importance of visual features increases. The results suggest that visual features of living thing and semantic features of artifact suffer from the projection domain problem in the zero-shot setting. Therefore, we can see that another view (i.e., semantic features of living thing and visual features of artifact), which is less affected by the problem, contributes to the zero-shot neural decoding.  Different additional group than the target group In the above results, we utilized additional image categories belonging to each target group. Here, we discuss the decoding performance when incorporating a different additional group than the target group. Figure 5 shows the identification accuracy improvement of the proposed model for all combinations of additional and target groups in comparison with the accuracy of MGM. Accuracy improvement was averaged across five subjects. From the result, we confirmed that the method using the same additional group as the target group is the best model, while the accuracy is improved when the additional group is similar to the target group (e.g., device and structure). On the other hand, the utilization of other additional groups degrades the decoding performance when the target group is equipment. Thus, as expected, it is important to add categories that are relevant to the target group. Effect of decoding performance on the source group We investigate the decoding performance of the source group when each additional group is incorporated. Figure 6 shows the decoding performance of the source group for each target and additional group (the additional group is the same as the target group) in MGM and the proposed model. Note that the source group means the other seven groups except for each target and additional group. The result indicates that the utilization of additional image categories slightly improves the decoding performance of the source group. Therefore, these results give us confidence that the semi-supervised approach contributes to the decoding of the target group and does not negatively affect the decoding of the source group.

Conclusions
This paper has proposed a semi-supervised multi-view embedding approach for zeroshot neural decoding. In zero-shot neural decoding, the training data are scarce because of the difficulty in collecting brain activity patterns; therefore, the projection domain shift problem between the source domain and the target domain becomes remarkable. We address this problem by introducing the semi-supervised approach that employs additional image categories related to the target domain. Furthermore, to exploit the complementary information, we assume that fMRI activity patterns are projected into the visual feature space and semantic feature space. The experimental results show the advantages of the proposed model over existing methods.
Given the difficulty in collecting fMRI activity, we incorporate additional visual and semantic features from abundant images while estimating corresponding fMRI activity patterns. This framework can be applied to not only neural decoding tasks but also other difficult learning tasks where one type of modal data are insufficient whereas other modal data are available abundantly.